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Abstract 

The Monte Carlo simulation of the dynamics of complex molecules produces 
trajectories with a large number of different configurations to sample configu- 
ration space. It is expected that these configurations can be classified into a 
small number of conformations representing essential changes in the shape of the 
molecule. We present a method to visualize these conformations by point sets in 
the plane based on a geometrical distance measure between individual configu- 
rations. It turns out that different conformations appear as well-separated point 
sets. The method is further improved by performing a cluster analysis of the 
data set. The point-cluster representation is used to control a three-dimensional 
molecule viewer application to show individual configurations and conformational 
changes. The extraction of essential coordinates and visualization of molecular 
shape is discussed. 



I Introduction 

Molecular dynamics simulations on large computers have become one of the main- 
stays for investigating the functions of biomolecules. Using statistical algorithms, they 
create a large number of snapshots of the molecule that approximate the expected 
distribution of molecular shapes in actual molecular processes. By looking for typ- 
ical shapes (conformations) and transition paths in this large data set, biochemists 
can learn about the molecular bases of biochemical processes. Such understanding is 
important in particular in designing more efficient medical drugs. 

Identifying typical shapes in such a large data set is itself a difficult task |]TJ. In most 
simulation, one focuses on a few characteristic numbers, like the angles or distances 
between specific atoms in the molecule, and monitors the change of these quantities in 
the simulation. This approach requires some advance knowledge about which parts of 
the molecule are important to the dynamics, and makes it also difficult to perform the 
analysis automatically. 

To use all the information computed in a molecular dynamics simulation, one 
must leave the analysis step again to a computer. We present here two procedures 
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to aid in this task: a projection method to visualize the molecular configurations of a 
trajectory as a point set in the plane, and a cluster analysis to identify clusters of similar 
configurations in the trajectory. These methods can be applied automatically to any 
molecular dynamics trajectory and result in a tentative identification of conformations 
in the trajectory. 



II Procedure 

II. 1 Configurations and feature vectors 

The output of a molecular dynamics simulation is a trajectory, i.e. sequence of configu- 
rations that depicts the evolution of the molecule in time. If the molecule consists of n 
atoms, the configuration x is described by n 3-dimensional vectors Xj e R 3 that spec- 
ify the cartesian position of each atom in 3-dimensional space. Other information, in 
particular which atoms are connected by chemical bonds, is a property of the molecule 
and usually does not change during the simulation. 

To classify the configurations, we must quantify how much their geometries differ. 
We thus assign to each configuration a feature vector that describes the geometry of 
the configuration is such a way that similar configurations have similar feature vectors. 
However, the set of 3n numbers that make up the cartesian positions of the atoms is 
unsuited as identical geometries can appear with different rotations and translations. 
While the translational freedom can easily be fixed by requiring that the center of mass 
conincides with the origin of the coordinate frame, the rotational degree of freedom is 
extremely difficult to eliminate. Fixing the axis of inertia can lead to sudden articifical 
rotations when the axes become degenerate, while fixing certain atoms to the coordinate 
axes always introduces a undesirable bias. 

A feature vector that is invariant under translations and rotations and does not 
introduce any bias can be chosen by considering the set of intramolecular distances ]2[ 

{dij(x) = |xi - Xj|, i,j e 1, ...,n} . (1) 

The price to pay is that instead of 3n elements, this vector has now n(n — l)/2, but 
geometrically identical configurations have identical feature vectors, and the cartesian 
distance in n(n — l)/2-dimensional space is a natural measure of conformational dis- 
tance: 



d(x,y) 



\ n(n - l)/2 



1 ^(d^-d^y)) 2 . (2) 



i>j 



Another frequent way to choose a feature vector is to use the dihedral angles 
between certain atoms as basic degrees of freedom. This is a natural choice as dihedral 
angles are the main degrees of motion in the simulations (atomic distances and bond 
angles are usually much more rigid). However, the potential energy that determines 
the dynamics of the molecule depends on the spatial distance of the atoms, and the 



3 



relation between dihedral angles and spatial distances is involved at best. We prefer 
here to put all information as unbiased as possible to the algorithm and depend on it 
to extract the relevant degrees of freedom. 

The feature vector ([I]) is vast compared to the number of degrees of freedom (in 
our example molecule, it has 2415 elements as compared to 70 x 3 — 6 = 204 degrees 
of freedom). Some of its elements will show little or no fluctuation (e.g. the ones 
associated to the lengths of chemical bonds), others will fluctuate thermally, and still 
others will assume different values in different fluctuations and thus exhibit a double- 
peaked distribution. To reduce the thermal noise, we analyze the elements of the 
feature vector statistically and select those whose distribution has the largest width. 
As thermal fluctuations are smaller than the conformational changes, this also selects 
the distances most affected by conformational changes. A similar procedure has been 
used by || to identify essential degrees of freedom in cartesian coordinate space. 



II. 2 Low-dimensional approximations 

The feature vector space is by far too large to be visualized directly. To capture the 
major properties of the point set that represents a trajectory in this space, we seek to 
visualize it in a plane, i.e. to assign each configuration a point in the two-dimensional 
plane such that the geometrical similarity between configurations is reproduced as 
faithfully as possible. After having chosen a distance measure (Q) on the trajectory, 
this reduces to the general problem of visualizing an arbitrary distance matrix Dij 
between a set of N configurations, where i and j now number configurations. 

One choice is to require that the mean quadratic deviation of the conformational 
distance from the distance in the plane, given by 

D^^d^-x.-l-Ai) (3) 

i>j 

is minimized by the choice of the points Xj, i.e. that the derivative of the quantity with 
respect to the position of the k-th point 

^ = Et^^7 (|x fc -x i |-d tt ) = (4) 

vanishes. This equation can be pictured physically by a set of springs that connect the 
points and whose natural length is given by the desired distance between the points. 

We solve the minimum problem of ([$]) numerically by the conjugate-gradient method. 
Though there is no guarantee that the minimum found by this method is the global 
one, the minimization takes place in a 2iV-dimensional space where it is improbable 
that a false minimum is stable in all directions. An example of this is the situation 
where we have a solution of Eq. (^) in D — 1 dimensions and then extend the solution 
space to D dimensions by setting x^d = for all i, which still satifies Eq. (f|). How- 
ever, this minimum (in D — 1 dimensions) now turns out to be a saddle point in D 
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dimensions, where the second derivative of D 2 is 

g 2 ( d fc ,-|x fc -x,| ifk^l 

ox KD ax hD y - 2^ fc ]Xfe _ x .| 
In a true minimum, this quantity is positive, thus requiring that 

t4z > |Xfc — x;| for all fe, I (6) 

but also 

E dfci i~ |Xfc ~| Xtl <° forallfc • (7) 



This will happen only if the first inequality is an equality, i.e. if the solution is complete. 

Another widely used low-dimensional approximation is based on the singular- value 
decomposition (SVD) of the feature matrix ||. Let be the feature matrix of % = 
1, . . . , n objects with j = 1, . . . , m features each. (In our example, n is the number of 
configurations while m is the number of intramolecular distances.) The singular- value 
decomposition expresses this matrix as a series 

a v = hUi'vf' (8) 

k 

where and v^ fc ^ are n- and m-dimensional, resp., orthonormalized basis vectors, 
and Afc gives the weight of fc-th term. The number of terms in the series is the rank of 
the matrix, it is at most the lower of n and m. 

The relation between singular-value decomposition and point sets in low-dimensional 
space can be seen by calculating the distance between feature vectors in terms of the 
SVD: 



Dij — ~^2(aik — ajk)' 



k 



= E (!>(«< 

k \ I 

= E*i («?>-«?>)' < 9 > 

k 

when orthonormality of v ^ is taken into account. Thus the vectors 

{X k u^ :k = l,...,m} (10) 

can be interpreted as specifying the cartesian positions in m-dimensional space of the 
i-th data point. When we chose in decreasing order, truncating the series after 
singular-value decomposition after I terms will lead position vectors in /-dimensional 
space that are best approximations in a linear sense. 

The major difference between the two approaches is that the SVD performs the 
approximation is a linear manner: When the dimension of the approximation space is 
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decreased from D to D — 1, the new approximation is simply obtained by orthogonally 
projecting out the last coordinate. In contrast, in the approximation obtained from 
minimizing @, the nonlinearity introduced by the square root redistributes some of 
the "lost" distance in the remaining dimensions. 



II. 3 Cluster analysis 

Cluster analysis || H is a statistical method to partition the point set into disjoint 
subsets with the property that the points in a subset are in some sense closer to 
each other than to the remaining points. There are several different ways to make 
this statement mathematically precise. We choose the notion of minimum residual 
similarity between clusters which leads to a natural formulation of the problem in terms 
of eigensystem analysis and to a heuristic algorithm for its solution. This spectral 
method goes back to works by Donath and Hoffmann |8[] on graph partitioning 
in computer logic and Fiedler P, |TD[ on acyclic graphs and was later picked up by 



Hendrickson [|16[]. Other cluster analysis methods based on neural networks or fuzzy 
clustering have also been applied to molecular dynamics simulations Jy], [T^| . 

Amadei et.al. went further by introducing the concept of essential dynamics in 
which the coordinate space of the molecule is split into a small essential subspace and 
a larger non-essential subspace. They assumed a linear factorization of the coordinate 
space and identified essential coordinates by large second moments of their distribution, 
assuming that these distributions are mainly non-Gaussian double-peaked shapes. 

To be as flexible as possible, we assume that a similarity measure 

< dij < 1, 1 < % < n (11) 

is given between the n data points, where Oy = indicates complete dissimilarity and 
ciij = 1 complete identity of configurations i and j. The residual similarity of a cluster 
C C {1, . . . , n} characterizes how similar elements of the cluster are to elements outside 
the cluster 

R(C)= £ Aij . (12) 

We wish to partition the data set into two subsets such that this quantity is minimized. 
Let Oj the characteristic vector of this partition, with value = 1 indicating that i G C, 
and otherwise a; = —1. Then the residual similarity can be rewritten 

r(c) = ^E(^-%) 2 ^ 




Aij cij 



(13) 



with the Laplacian matrix 



6 



To find the minimum of the expectation value (a, Ma) over the vectors that have 
element ±1 only, is a hard combinatorial problem. However, if we relax the problem 
and allow real values for the a\ with the constraint \a\ = 1, the problem is exactly 
the problem of finding the second-lowest eigenvector of the matrix M (the lowest 
eigenvector corresponds to the solution aj = 1 that does not lead to a proper partition). 
Since eigenvectors are orthogonal, the second-lowest eigenvector satisfies 

J>-=1 and ^O; = (15) 

i i 

and thus guarantees that it will contain both positive and negative eigenvalues. In 
graph theory, this eigenvector is called the characteristic valuation of a graph. 

Low-lying eigenvectors of a matrix can be found using iterative methods even for 
moderately large matrices. There is, however, no safe way to recover the solution of 
the combinatorial problem, where a is restricted to values of ±1 from it, but it can be 
argued that for most matrices the eigenvector will constitute a good approximation to 
the combinatorial problem. We thus map the continuous value a; to discrete value a; 
using a threshold I: 

/ -1 if <n < I , v 

a * = \ +1 ifa 4 >/ • (16) 

The threshold can now be determined by minimizing the residual similarity over all 
possible thresholds. In this way, the minimization problem is reduced from n! to just n 
options, and the characteristic valuation serves as a heuristic to determine the options 
that are taken into consideration. 

The measure of residual similarity favors in general splitting off a single point from 
the data set since (|i~3|) contains in this case only n — 1 terms, as compared to n 2 /4 when 
splitting symmetrically. This automatically introduces a quality control in the splits, 
as central splits occur only when the cluster separation is rather favorable, but might 
also hinder the analysis of noisy data. However, the special form ([13]) was only chosen 
to turn the problem into an eigenvalue problem. As the whole procedure is heuristic 
in nature, we may well decide to use a different similarity measure when determining 
the splitting threshold, e.g. a measure that includes a combinatorial factor 



Which measure is correct depends mainly on the application. The original measure 
is stricter in what it returns as a cluster, while the latter measure favors balanced 
splittings. In some problems, like partitioning matrices for processing in a parallel 
computer, one may even demand that each split is symmetrical. 

Another approach taken frequently in cluster analysis is to use the singular-value 
decomposition JOJ. If we go back to the feature matrix Ay and its singular value 
decomposition (|J), it turns out that the vectors correspond to minima of the 
expectation value with respect to the feature matrix squared, i.e. 

UiAik] = ■ Aj)uj (18) 
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where we introduced the row vectors A4 of the matrix Aik, i.e. the feature vector of 
data point i. Thus in this approach the role of the similarity matrix is taken over by 
the scalar-product matrix of the feature vectors. The major differences are 

1. The scalar products are not less than or equal to one, but this could easily be 
fixed by globally rescaling the scalar-product matrix, which does not change the 
vectors vS l > . 

2. The scalar products can be negative. The notion of a scalar-product is not of 
similarity and dissimilarity but rather the trichotomy of similar, orthogonal, and 
antagonistic. 

Thus, singular-value decomposition seems suitable for feature vectors that characterize 
orthogonal qualities. However, this is not the case in our feature vectors, so we chose 
a similarity measure based upon distance. 

After partitioning the data set into two subsets, we proceed to apply the algorithm 
again to these subsets. In this way, one obtains a splitting tree that terminates only 
when the subset size is smaller than three. For many applications, such a tree is already 
quite useful as it orders the data points in such a way to similar data points are usually 
close to each other. 

To identify clusters in the splitting tree, we found that the average width of the 
cluster relative to that of its parent cluster gives the best indication. To calculate the 
average width of a cluster we use the Euclidean distance in the high-dimensional space 
and average over all distinct pairs of points in a cluster. This quantity relative to that 
of the parent cluster basically indicates how much the closer the points are on average 
in the subcluster than in the original cluster and thus how much the split improves 
the cluster criterion. Consider e.g. the situation where there are three clusters. The 
first split will result in one correctly identified cluster and a second pseudo-cluster that 
encompasses the other two, but the relative width of the true cluster will be much 
smaller than that of the pseudo-cluster. Only after the next split it will be revealed 
that the latter consists of two clusters. Typical values for this quantity are between 
0.5 and 0.8. 



Ill Results 

We apply our methods to a molecular dynamics simulation of the molecule adenylyl(3 ? - 
5')cytidylyl(3'-5')cytidin in vacuum jTJJ]. This is a very simple tri-ribonucleotide, con- 
sisting of three residues and 70 (effective) atoms. The simulation was performed using 
the GROMOS96 |~5| extended atom force field. For the analysis, we chose a subset of 
1000 configurations equidistantly from the trajectory. 

Fig. [I] shows the two-dimensional map of the trajectory found by minimizing (JJ). 
The points are connected by a line in the same sequence as they are generated in the 
Monte Carlo simulation. This information does not enter in determining the locations 
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Figure 1: Two-dimensional map of 1000 configurations chosen from a molecular dy- 
namics trajectory 
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Figure 2: Map of the trajectory in the plane spanned by two typical distances in the 
molecule 

of the points in the plane, so the fact that the line segments are rather short indicates 
that point adjacent in the trajectory are mapped to nearby points in the plane and 
thus are recognized as geometrically similar by the algorithm. The one pair of lines 
that crosses nearly the whole plane horizontally is actually made up of the first three 
data points and therefore a transient effect before the molecule became equilibrated. 

We immediately notice that there are at least three clearly different groups of points 
which constitute conformations in a geometrical sense, i.e. subsets of the trajectory with 
similar geometrical properties. That they are also dynamical conformations can be seen 
from the fact that the connecting line of the points only very rarely crosses from one 
point group into the next. This again confirms that the two-dimensional layout in the 
plane chosen by the algorithm represents correctly the dynamics of the system. 

The representation of Fig. [I] can be compared to a representation where the dimen- 
sionality of the system is reduced by chemical understanding. As the system consists 
of three residues, most of the conformational dynamics can be assumed to be in the 
geometrical layout of the residues. This can be described by only three numbers, and 
we chose two of them to create the two-dimensional representation shown in Fig. 
This picture is similar to Fig. [I] in that there appear approximately three distinct point 
groups, and it can be verified that they correspond to the point groups from Fig. [I]. 
However, the separation of the point groups is less clear than in Fig. |l|. This indi- 
cates that the conformational dynamics is not simply the motion of the centers of the 
residues, but there are also smaller rearrangements in the residues themselves that are 
correlated to the large-scale motions. By considering an unbiased measure for geomet- 
rical similarity, all those little rearrangements enter and reinforce the distance between 
conformations in the plane. 
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Figure 3: Different clusters identified in the trajectory by the clustering algorithm. 
Figures a, b, and c show the decomposition of the trajectory into three conformational 
clusters , while d, e, and f show the substructure of one such cluster. 
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III. 1 Cluster analysis 

Applying the cluster algorithm to the similarity matrix of the trajectory, the first few 
splits remove 22 isolated points before the small cluster shown in Fig. |3|a with 40 points 
and a relative width of 0.46 (both compared to its immediate predecessor and to the 
initial point set) shows up. The remaining points are split some steps further into a 
cluster with 698 points shown in Fig. [3]b and another cluster with 230 points shown in 
Fig. |3|c with relative widths of about 0.53. After some more steps, the larger subcluster 
is broken into three subclusters with 388, 52, and 21 points, resp., and relative widths 
of 0.91, 0.73, and 0.61, resp., as shown in Fig. |3]d, e, and f. Similarly, the smaller 
subcluster also separates into three weak subclusters. 

The splitting line of the large cluster at the bottom is also visible in Fig. |l|. Such 
a pattern usually indicates that beside a large conformational change that induces the 
three clearly visible clusters, where the middle cluster is clearly a transitional state, 
there is another smaller conformational change, possibly in one of the glucose rings, 
independent of the larger one. As it only affects a small part of the molecule, the 
conformational distance is smaller and is then imprinted like a fine structure on the 
clusters. That such changes are visible in the plot is an advantage from considering all 
atom coordinates without bias. 



IV Conclusions and Outlook 

We have demonstrated a method for projecting a molecular dynamics trajectory onto 
a plane to capture the conformational structure of the trajectory. Conformations can 
in this way be easily identified by visual inspection. Cluster analysis on the full confor- 
mational distance matrix also revealed these clusters, but also allowed to discern fine 
structure inside the clusters caused by smaller conformational changes. 

To simplify the analysis of a trajectory, we have created a Java application that 
reads the output files of the combined plane mapping/cluster analysis program and 
displays the two-dimensional map. This program interacts directly with an Open In- 
ventor molecular visualization application by means of a Unix pipe. Whenever the user 
selects a point in the plane, the corresponding configuration is shown in the visualiza- 
tion program. The user can also choose to display identified clusters using different 
colors in the map. 

Identifying which parts of the molecule are responsible for different conformations 
is a much more difficult problem. We use a visualization application that allows the 
user to form groups of atoms that are visualized by ellipsoids. In this way, a molecule 
can be easily reduced to its functional groups where it is much easier to spot confor- 
mational changes. However, small conformational changes as those that show up as a 
fine structure on the plane map are easily lost in this representation. 

As a first attempt at aiding the eye in discovering unusual motions of the molecule, 
we implemented a simple OpenGL effect in the visualization application that allows to 
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Figure 4: Representation of the collective motion by an OpenGL fading effect 

blend several frames of an animation in the hope that large changes stand out more 
clearly in this representation. Fig. ^ shows one such picture. Certainly more research 
can be expended on how to identify and visualize the essential degrees of freedoms. 

The concept of essential molecular dynamics has been introduced to reduce the 
number of degrees of freedom in the simulation. Both the plane map and the cluster 
analysis can be used to inflict new coordinates upon the system. For the point map, 
these are simply the x and y positions of the configuration in the plane. Once a certain 
point map has been established, new configurations can be fitted into the plane by 
minimizing the residual distance while keeping all other points fixed. Similarly, the 
cluster analysis assigns to each configuration a position in the tree that can be seen as 
a (discrete) essential coordinate. How such essential coordinates can be reintroduced 
into the dynamics of the system is still an open question. 
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